E is for Exploratory Data Analysis: Images
What is Exploratory Data Analysis (EDA), why is it done, and how do we do it in Python?
While my previous posts outlined methods for conducting EDA for numeric data as well as categorical data, this post focuses on EDA for images.
What is Exploratory Data Analysis (EDA)?
Again, since all learning is repetition, EDA is a process by which we 'get to know' our data by conducting basic descriptive statistics and visualizations.
Why is it done for images?
We need to know:
- how many images we have
- if we're doing supervised learning, if they labeled appropriately
- their format (i.e. size and color)
How do we do it in Python?
Step 1: Frame the Problem
"Is it possible to determine the minimum age a reader should be for a given book based solely on the cover?"
Step 2: Get the Data
As mentioned in my previous posts, I sourced labeled training data from Common Sense Media's Book Reviews by scraping and saving the target pages using BeautifulSoup

and then extracted and saved the book covers into a separate folder.
In the end, I was able to use over 5000 covers for training and testing purposes, but today we'll only work with a sample of the covers which can be downloaded from here.
Step 3: Explore the Data to Gain Insights (i.e. EDA)
As always, import the essential libraries, then load the data.
import pandas as pd
import numpy as np
import os
import cv2
import ipyplot
IMAGES_PATH = "data/covers/"
image_files = list(os.listdir(IMAGES_PATH))
full_file_paths = [IMAGES_PATH+image for image in image_files]
print("Number of image files: {}".format(len(image_files)))
What does our target look like?
To answer that question, we can create a data frame of the book titles and the target ages in our sample, and then plot the target.
Since I scraped the data, I know the beginning of the file name is the target age, (e.g., 13 is the minimum age for the file '13_dance-of-thieves-book-1.jpg') so we can create a data frame of the:
- file names
- full paths
- and a target column called
ageby splitting the file name on the underscore and extracting the first element
data = {'files':image_files, 'full_path':full_file_paths}
df = pd.DataFrame(data=data)
df['age'] = df['files'].str.split("_").str[0].astype('int')
df.head()
Now we can plot the age feature.
df['age'].plot(kind= "hist",
bins=range(2,18),
figsize=(24,10),
xticks=range(2,18),
fontsize=16);
Thankfully, the plot above has a nearly identical distribution to the entire sample (see this post) so all is good and we can continue.
What do our covers look like?
We know the general shape of our target, but let's get a feel for what the targets (i.e. the book covers) look like by using the IPyPlot package.
To do so, we convert the path to the images and the target numpy arrays:
images = df['full_path'].to_numpy()
labels_int = df['age'].to_numpy()
and then pass them as arguments to the plot_class_representations function which will return the first instance of each of our targets.
In other words, the function will print the first book which rated for 2 year olds, 3 year olds, 4 year olds, (etcetera) until all levels of the target are represented.
ipyplot.plot_class_representations(images=images, labels=labels_int, force_b64=True)
There seems to be a correlation between the dimensions of the books and the target age; books for younger readers are more square whereas books for older readers are more rectangular.
Let's investigate that further by plotting multiple covers per age which we can do by using the plot_class_tabs function.
ipyplot.plot_class_tabs(images=images, labels=labels_int, max_imgs_per_tab=4, force_b64=True)
Hmmmmm. Could be true but we'll need more evidence to be certain.
To investigate this hypothesis, we can investigate a possible correlation between the size of the cover and the target age.
What are the sizes and channels of our covers?
The size of our covers will be the height and width of our images and, importantly, number of channels is whether the cover is in color (three channels) or graysacle (two).
We can compute the dimensions of the books by extracting the height and width from shape of the images.
First create a list of arrays for the covers:
covers = [cv2.imread(IMAGES_PATH+image) for image in image_files]
Congratulations! All of our covers are now stored as a list of arrays of pixel data so we can use shape to inspect the dimensions of our covers.
For example, the first cover in our collection is Dance of Thieves which looks like this:
import matplotlib.pyplot as plt
sample = df.iloc[0,1]
sample_img = cv2.imread(sample)
plt.imshow(sample_img)
plt.xticks([]), plt.yticks([]) # to hide tick values on X and Y axis
plt.show()
Now to inspect the dimensions of the cover above, we call shape on it like so:
covers[0].shape
which returns a tuple.
"What does this tuple contain?"
- the height and width (i.e., rows and columns) measured in pixels
- assuming our cover is in color, the number of channels (i.e., RGB aka red, green, blue)
Therefore, from the output above, we know Dance of Thieves has is 255 pixels high by 170 pixels wide, and is (obviously) in color.
"But what about the rest of the covers?"
Glad you asked.
Get the Dimensions of All Covers
Since we know the order of the elements of the tuple, namely, height, width, channel, we can use indexing and list comprehension to get the dimensions of our covers like this:
width = [cover.shape[1] for cover in covers]
height = [cover.shape[0] for cover in covers]
channels = [cover.shape[2] for cover in covers]
and then add it to our data frame.
df['width'] = width
df['height'] = height
df['channels'] = channels
df.head()
set(df['width'])
set(df['height'])
Now we can return to the question of whether there is a relationship between the dimensions of the covers and the target age of the reader.
df['ratio'] = df.width / df.height
df.head()
df.loc[:,['age', 'height', 'ratio']].corr()
df.loc[:,['age', 'width']].plot()
import seaborn as sns
#Create the correlation matrix
corr = df.corr()
#Generate a mask to over the upper-right side of the matrix
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
#Plot the heatmap with correlations
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(8, 6))
ax = sns.heatmap(corr, mask=mask, annot=True, square=True)
width = []
height = []
channels = []
for image in image_files:
img = cv2.imread(IMAGES_PATH+image)
img = img.shape
height.append(img[0])
width.append(img[1])
channels.append(img[2])
df.head()
Summary
-
numeric data -
categorical data -
images (book covers)
Two down; one to go!
Going forward, my key points to remember are:
What type of categorical data do I have?
There is a huge difference between ordered (i.e. "bad", "good", "great") and truly nominal data that has no order/ranking like different genres; just because I prefer science fiction to fantasy, it doesn't mean it actually is superior.
Are missing values really missing?
Several of the features had missing values which were, in fact, not truly missing; for example, the award and awards features were mostly blank for a very good reason: the book didn't win one of the four awards recognized by Common Sense Media.
In conclusion, both of the points above can be summarized simply by as "be sure to get to know your data."
Happy coding!
Footnotes
1. Adapted from Engineering Statistics Handbook↩
2. Be sure to check out this excellent post by Jeff Hale for more examples on how to use this package↩
3. See this post on Smarter Ways to Encode Categorical Data↩
4. Big Thank You to Chaim Gluck for providing this tip↩